ismayc@old_man_chesterrudeboybert@rudeboybertLINK TO DOCUMENT
- ModernDive’s guiding philosophies are deeply intertwined our syllabi
- Hard to speak of ModernDive in isolation from syllabi and vice-versa
- Numbers are numbers, but data has context…
tidyverse, then stats.Cobb (TAS 2015): Minimizing prerequisites to research. In other words, focus on entirety of Wickham/Grolemund’s pipeline…
… and not just this part.
Furthermore use data science tools that a data scientist would use. Example: tidyverse
What does this buy us?
nycflights13 and fivethirtyeightCobb (TAS 2015): Two possible “computational engines” for statistics, in particular relating to sampling:
- Mathematics: formulas, probability theory, large-sample approximations, central limit theorem
- Computers: simulations, resampling methods
We present students with a choice for our “engine”:
| Either we use this… | Or we use this… |
|---|---|
- Almost all are thrilled to do latter
- Leave “bread crumbs” for more advanced math/stats courses
What does this buy us?
Why should we do this?
Insert appropriate image
DataCamp offers an interactive, browser based tool for learning R/Python. Their two flagship R courses, both of which are free:
Outsource many essential but not fun to teach topics like
ggplot2, dplyr, RMarkdown, and RStudio IDE.ggplot2 package and knowledge of the Grammar of Graphics primes students for regressionbroom package to unpack regressionggplot2 Primes Regressionggplot2 Primes RegressionExample:
This involves four variables carrier, temp, dep_delay, summer
ggplot2 Primes Regressionggplot2 Primes RegressionWhy? Dig deeper into data. Look at origin and dest variables as well:
| carrier | origin | dest | Number of Flights |
|---|---|---|---|
| AS | EWR | SEA | 712 |
| F9 | LGA | DEN | 675 |
broom Packagebroom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.tidyverse ecosystembroom PackageIn our case, broom functions take lm objects as inputs and return the following in tidy format!
tidy(): regression output tableaugment(): point-by-point values (fitted values, residuals, predicted values)glance(): scalar summaries like \(R^2\),broom PackageThe chapter will be built around this code:
library(ggplot2)
library(dplyr)
library(nycflights13)
library(knitr)
library(broom)
set.seed(2017)
# Load Alaska data, deleting rows that have missing departure delay
# or arrival delay data
alaska_flights <- flights %>%
filter(carrier == "AS") %>%
filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
sample_n(50)
View(alaska_flights)
# Exploratory Data Analysis----------------------------------------------------
# Plot of sample of points:
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point()
# Correlation coefficient:
alaska_flights %>%
summarize(correl = cor(dep_delay, arr_delay))
# Add regression line
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red")
# Fit Regression and Study Output with broom Package---------------------------
# Fit regression
delay_fit <- lm(formula = arr_delay ~ dep_delay, data = alaska_flights)
# 1. broom::tidy() regression table with confidence intervals and no p-value stars
regression_table <- delay_fit %>%
tidy(conf.int=TRUE)
regression_table %>%
kable(digits=3)
# 2. broom::augment() for point-by-point values
regression_points <- delay_fit %>%
augment() %>%
select(arr_delay, dep_delay, .fitted, .resid)
regression_points %>%
head() %>%
kable(digits=3)
# and for prediction
new_flights <- data_frame(dep_delay = c(25, 30, 15))
delay_fit %>%
augment(newdata = new_flights) %>%
kable()
# 3. broom::glance() scalar summaries of regression
regression_summaries <- delay_fit %>%
glance()
regression_summaries %>%
kable(digits=3)
# Residual Analysis------------------------------------------------------------
ggplot(data = regression_points, mapping = aes(x = .resid)) +
geom_histogram(binwidth=10) +
geom_vline(xintercept = 0, color = "blue")
ggplot(data = regression_points, mapping = aes(x = .fitted, y = .resid)) +
geom_point() +
geom_abline(intercept = 0, slope = 0, color = "blue")
ggplot(data = regression_points, mapping = aes(sample = .resid)) +
stat_qq()
# Preview of Multiple Regression-----------------------------------------------
flights_subset <- flights %>%
filter(carrier == "AS" | carrier == "F9") %>%
left_join(weather, by=c("year", "month", "day", "hour", "origin")) %>%
filter(dep_delay < 250) %>%
mutate(summer = ifelse(month == 6 | month == 7 | month == 8, "Summer Flights", "Non-Summer Flights"))
ggplot(data = flights_subset, mapping = aes(x = temp, y=dep_delay, col=carrier)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~summer)By July 1st, 2017
knitr::kable() output orView() function.